Setup

Part 1: Data

We have acquired data about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset includes information from Rotten Tomatoes and IMDb for a random sample of movies.

We’re interested in learning what attributes make a movie popular.

The data set is comprised of 651 randomly sampled movies produced and released before 2016, and scored on both Rotten Tomatoes and IMDb.

We deal with a retrospective observational study with random sampling and with no random assignment here. This means that we cannot make causal conclusions based on this data. However, we can use this data for making correlation statements, and the results of our findings can be generalized to the whole population of movies released before 2016 and scored on both Rotten Tomatoes and IMDb.


Part 2: Research question

Without watching a movie, can we say whether the audience loves it?

Obviously, the popularity of a movie is mainly based on its content and the art value it offers. Those two factors are very hard if not impossible to calculate, though. However, there are certain characteristics of a movie that we can calculate and classify comparatively easy. Among others, those are genre, duration, and cast quality. Using those and other variables from the data frame, we’ll try creating a prediction model for movie popularity among the audience.


Part 3: Exploratory data analysis

We start our analysis with a high-level overview of the data frame provided. The table below shows the number of movies of three major types (Documentary, Feature Film, and TV Movie) in the data set, the time range of theatre releases (thtr_rel_year), as well as the median audience rating on both IMDB (imdb_rating) and Rotten Tomatoes (audience_score). We choose median instead of mean for the audience rating as we suspect the rating distributions are skewed and the median is a more robust statistics for skewed distributions.

title_type total min(thtr_rel_year) max(thtr_rel_year) median(imdb_rating) median(audience_score)
Documentary 55 1970 2012 7.7 86
Feature Film 591 1972 2014 6.5 62
TV Movie 5 1993 2012 7.3 75

Table 1. Movies statistics grouped by the tytle type

We can see the majority of movies in the data set belongs to the ‘Feature Film’ group. 55 movies are Documentaries and only 5 are TV Movies.

It’s expected that the popularity of movies in these three major groups would depend on very different factors. This partially confirmed by the significant difference between the median rating of Featured Films and two other groups. Thus, we decided to deal only with the most populous group of movies in this data set, Featured Films. We then create a subset (ff for Feature Film) of the data frame and use it from now:

Let’s take a look at the data summary now (we exclude some variables from the summary to make the report shorter):

##                 genre        runtime       mpaa_rating 
##  Drama             :301   Min.   : 68.0   G      : 16  
##  Comedy            : 85   1st Qu.: 93.0   NC-17  :  2  
##  Action & Adventure: 65   Median :104.0   PG     :110  
##  Mystery & Suspense: 59   Mean   :106.6   PG-13  :130  
##  Horror            : 23   3rd Qu.:116.0   R      :317  
##  Other             : 15   Max.   :202.0   Unrated: 16  
##  (Other)           : 43                                
##                               studio    thtr_rel_year  thtr_rel_month  
##  Paramount Pictures              : 37   Min.   :1972   Min.   : 1.000  
##  Warner Bros. Pictures           : 29   1st Qu.:1990   1st Qu.: 4.000  
##  Sony Pictures Home Entertainment: 26   Median :1999   Median : 7.000  
##  Universal Pictures              : 23   Mean   :1997   Mean   : 6.783  
##  Warner Home Video               : 19   3rd Qu.:2006   3rd Qu.:10.000  
##  (Other)                         :451   Max.   :2014   Max.   :12.000  
##  NA's                            :  6                                  
##   thtr_rel_day    dvd_rel_year  dvd_rel_month     dvd_rel_day   
##  Min.   : 1.00   Min.   :1991   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 7.00   1st Qu.:2001   1st Qu.: 3.000   1st Qu.: 7.00  
##  Median :15.00   Median :2003   Median : 6.000   Median :15.00  
##  Mean   :14.48   Mean   :2004   Mean   : 6.304   Mean   :15.06  
##  3rd Qu.:21.50   3rd Qu.:2007   3rd Qu.: 9.000   3rd Qu.:23.00  
##  Max.   :31.00   Max.   :2015   Max.   :12.000   Max.   :31.00  
##                  NA's   :6      NA's   :6        NA's   :6      
##   imdb_rating    imdb_num_votes           critics_rating critics_score   
##  Min.   :1.900   Min.   :   390   Certified Fresh:116    Min.   :  1.00  
##  1st Qu.:5.850   1st Qu.:  6276   Fresh          :172    1st Qu.: 31.00  
##  Median :6.500   Median : 17934   Rotten         :303    Median : 57.00  
##  Mean   :6.387   Mean   : 62861                          Mean   : 54.78  
##  3rd Qu.:7.100   3rd Qu.: 66112                          3rd Qu.: 79.00  
##  Max.   :9.000   Max.   :893008                          Max.   :100.00  
##                                                                          
##  audience_rating audience_score  best_pic_nom best_pic_win best_actor_win
##  Spilled:273     Min.   :11.00   no :569      no :584      no :500       
##  Upright:318     1st Qu.:44.50   yes: 22      yes:  7      yes: 91       
##                  Median :62.00                                           
##                  Mean   :60.47                                           
##                  3rd Qu.:78.00                                           
##                  Max.   :97.00                                           
##                                                                          
##  best_actress_win best_dir_win top200_box
##  no :521          no :548      no :576   
##  yes: 70          yes: 43      yes: 15   
##                                          
##                                          
##                                          
##                                          
## 

The variables of most interest are related to movies rating: audience rating on IMDB (imdb_rating), critics score on Rotten Tomatoes (critics_score), and audience score on Rotten Tomatoes (audience_score). We can see that median statistics for all three variables are higher than respective mean statistics. This proves our early assumption of the skeweness of the rating distributions. In fact, all three distributions are left skewed. This can also be demonstrated with the following histograms:

Interesting that an IMDB rating shows a unimodal left-skewed distribution centered around 6.4 (Plot 1) but both scores on Rotten Tomatoes show almost a uniform distribution with a slight left skew and no apparent centers (Plot 2 and 3). This discrepancy could be explained by the larger number of voting users on IMDb so the true population mean is more obvious. But this is questionable as we don’t have data on the number of votes for Rotten Tomatoes scores.

We can also conclude that IMDB users are less likely to give a low rating to a movie than Rotten Tomatoes users and critics. Indeed, 75% of movies received a rating of 58.5 (or 5.85 on a scale of 1 to 10) or higher on IMDB, 44.5 or higher from the audience on RT, and only 31 or higher from the critics on RT.

However, the most plausible cause of the diference in distributions lies in methods of calculating rating and scores on two platforms:

Basically, the Rotten Tomatoes score only counts positive rates while IMDb rating counts all rates.

For example, an audience score of 25% on Rotten Tomatoes can be translated to an IMDb rating between 2.5 and 7.5 depending on individual ratings that made up that 25% score on RT. In other words, in the plots above we do not compare apples to apples (or we should say tomatoes to tomatoes).

Let’s take a look at the median statistics for feature films grouped by genre:

genre total IMDb RT_critics RT_audience
Action & Adventure 65 6.00 33 52.0
Animation 9 6.40 48 65.0
Art House & International 14 6.50 52 65.5
Comedy 85 5.70 36 49.0
Documentary 3 6.90 40 74.0
Drama 301 6.80 67 70.0
Horror 23 5.90 40 43.0
Musical & Performing Arts 8 7.25 80 82.5
Mystery & Suspense 59 6.50 60 54.0
Other 15 7.00 72 74.0
Science Fiction & Fantasy 9 5.90 67 47.0

Table 2. Feature Films rating and scores medians by genre

We can see that Drama is the most populous category with more than half movies of the data set and median IMDb rating of 6.80, RT critics score of 67, and RT audience score of 70. The highest median IMDb rating of 7.25 is in the Musical & Performing Arts category. This group also has the highest scores of all on RT among both critics and the audience. We need to note that only 8 movies fall in this group, so this stats can be quite different for a larger sample. The Comedy group has the lowest median IMDb rating of all, 5.70, second to lowest for RT critics and third to lowest for RT audience.

It’s interesting to see the distribution of movies runtime:

Here we can see a right-skewed unimodal distribution centered around 100 minutes with several outliers.


Part 4: Modeling

The data set contains several variables on movies: title, runtime, date of release, production company, cast, nominations, ratings, and scores. We’d love to predict the popularity of the movie (represented by the IMDb rating) based on a certain combination of the rest of the variables using a multiple linear regression method.

We will start with modeling where the response variable is IMDb rating (imdb_rating).

IMDb rating prediction model

Developing a full model as a reference to all future models would be a good first step. The full model includes all potential predictor variables. However, we would omit some of the variables as they are only in data set for informational purposes and do not make any sense to include in statistical analysis.

Here’s a list of variables we can omit:

  • The information in the director and actor1 through actor5 variables was used to determine whether the movie casts a director, an actor or actress who won Oscar.
  • Variable like imdb_url and rt_url obviously cannot be associated with the popularity of a movie, thus it should be omitted.
  • Wording of a title of the movie by itself is meaningless for answering the research question. But we can try using a title length as a predictor.
  • We can omit the title_type variable as we only work with Feature Film here.
  • The actual day of the month of the theatrical (thtr_rel_day) or DVD (dvd_rel_day) release is also meaningless for predicting the movie’s popularity. We can think of some correlation between a month or year and movie popularity. We suspect it’s going to be low if any, though.
  • We should also omit critics_rating and audience_rating variables as they are basically the derivatives of critics_score and audience_score variables respectively. So, they are obviously collinear and adding more than one of these variables to the model would not add much value to the model.
  • We suspect a collinearity between best_pic_nom and best_pic_win variables. Obviously, a movie can’t win Oscar without being nominated. We should remove one of the variables, say, best_pic_win then.

The question is whether we should use Rotten Tomatoes scores in predicting IMDb rating and vice versa. Technically speaking, a popular movie would rate high on both platforms (it’s not always the case, though). This means that rating on IMDb would have a positive correlation with audience score on RT despite the fact they are calculated differently. We can check if this is the case by calculating a correlation coefficient:

## [1] 0.8487537

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     36.5493       0.4519  
## 
## Sum of Squares:  18380.56
## integer(0)

Indeed, we can see that the correlation coefficient is very high (~0.85) and positive. However, the variance doesn’t seem to be constant (little variance for popular movies and larger variance for unpopular movies) for the reason discussed in Part 3. There’s definitely a correlation between these two variables but it’s not linear.

The previous assumption is even more apparent with RT critics score and critics rating variables plotted against IMDb audience rating (Plots 10-13):

The funnels we see is a perfect illustration of non-constant variability described earlier. Thus we shouldn’t be using RT scores in predicting IMDb rating using multiple linear regression.

The full model should then look like this:

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + 
##     thtr_rel_month + dvd_rel_year + dvd_rel_month + imdb_num_votes + 
##     best_pic_nom + best_actor_win + best_actress_win + best_dir_win + 
##     top200_box, data = na.omit(ff))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8520 -0.4062  0.0699  0.5605  2.0753 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.871e+01  1.701e+01   2.863 0.004358 ** 
## genreAnimation                 -2.379e-01  3.530e-01  -0.674 0.500658    
## genreArt House & International  9.488e-01  2.802e-01   3.386 0.000761 ***
## genreComedy                    -3.070e-02  1.450e-01  -0.212 0.832405    
## genreDocumentary                1.029e+00  5.070e-01   2.030 0.042886 *  
## genreDrama                      6.849e-01  1.247e-01   5.494 6.04e-08 ***
## genreHorror                    -9.927e-02  2.177e-01  -0.456 0.648561    
## genreMusical & Performing Arts  1.186e+00  3.200e-01   3.708 0.000230 ***
## genreMystery & Suspense         4.323e-01  1.625e-01   2.660 0.008038 ** 
## genreOther                      4.802e-01  2.486e-01   1.932 0.053930 .  
## genreScience Fiction & Fantasy -8.811e-02  3.179e-01  -0.277 0.781739    
## runtime                         5.149e-03  2.620e-03   1.965 0.049896 *  
## mpaa_ratingNC-17               -2.281e-01  8.845e-01  -0.258 0.796598    
## mpaa_ratingPG                  -6.047e-01  2.600e-01  -2.325 0.020413 *  
## mpaa_ratingPG-13               -8.087e-01  2.708e-01  -2.986 0.002952 ** 
## mpaa_ratingR                   -5.297e-01  2.628e-01  -2.015 0.044346 *  
## mpaa_ratingUnrated              1.343e-01  3.551e-01   0.378 0.705479    
## thtr_rel_year                  -1.095e-02  4.817e-03  -2.274 0.023364 *  
## thtr_rel_month                  7.542e-03  1.068e-02   0.706 0.480371    
## dvd_rel_year                   -1.062e-02  1.052e-02  -1.009 0.313386    
## dvd_rel_month                   2.041e-02  1.069e-02   1.909 0.056754 .  
## imdb_num_votes                  3.500e-06  3.682e-07   9.504  < 2e-16 ***
## best_pic_nomyes                 3.665e-01  2.040e-01   1.797 0.072954 .  
## best_actor_winyes               1.472e-02  1.040e-01   0.142 0.887487    
## best_actress_winyes            -1.733e-02  1.141e-01  -0.152 0.879318    
## best_dir_winyes                 2.011e-01  1.412e-01   1.424 0.155037    
## top200_boxyes                  -9.112e-02  2.378e-01  -0.383 0.701758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.841 on 546 degrees of freedom
## Multiple R-squared:  0.3896, Adjusted R-squared:  0.3605 
## F-statistic:  13.4 on 26 and 546 DF,  p-value: < 2.2e-16

The full model includes all variables that we believe meaningful. It has an adjusted R2 of summary(imdb_full)$adj.r.squared which means that the model explains approximately 36% of the variance in the response variable (imdb_rating in this case). Let’s see if we can come up with a model of higher predicting power by changing a combination of predictors.

We’ll be using a forward selection technique here. The forward selection model starts with an empty model. Then, we add variables one-at-a-time until we cannot find any variables that improve the model (as measured by adjusted R2).

## 
## Call:
## lm(formula = bestformula, data = ff_imdb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8427 -0.4093  0.0707  0.5695  2.0833 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.851e+01  1.688e+01   2.873 0.004222 ** 
## imdb_num_votes                  3.459e-06  3.537e-07   9.779  < 2e-16 ***
## genreAnimation                 -2.238e-01  3.497e-01  -0.640 0.522592    
## genreArt House & International  9.483e-01  2.786e-01   3.404 0.000713 ***
## genreComedy                    -2.578e-02  1.434e-01  -0.180 0.857393    
## genreDocumentary                1.020e+00  5.052e-01   2.019 0.043925 *  
## genreDrama                      6.829e-01  1.228e-01   5.558 4.25e-08 ***
## genreHorror                    -9.543e-02  2.167e-01  -0.440 0.659819    
## genreMusical & Performing Arts  1.193e+00  3.186e-01   3.745 0.000200 ***
## genreMystery & Suspense         4.269e-01  1.594e-01   2.678 0.007637 ** 
## genreOther                      4.683e-01  2.467e-01   1.898 0.058179 .  
## genreScience Fiction & Fantasy -9.631e-02  3.166e-01  -0.304 0.761131    
## thtr_rel_year                  -1.062e-02  4.774e-03  -2.225 0.026497 *  
## mpaa_ratingNC-17               -1.778e-01  8.796e-01  -0.202 0.839905    
## mpaa_ratingPG                  -5.916e-01  2.581e-01  -2.293 0.022245 *  
## mpaa_ratingPG-13               -8.009e-01  2.683e-01  -2.985 0.002958 ** 
## mpaa_ratingR                   -5.131e-01  2.597e-01  -1.976 0.048673 *  
## mpaa_ratingUnrated              1.471e-01  3.526e-01   0.417 0.676637    
## runtime                         5.503e-03  2.472e-03   2.226 0.026401 *  
## best_pic_nomyes                 3.846e-01  1.992e-01   1.931 0.054054 .  
## dvd_rel_month                   1.902e-02  1.048e-02   1.814 0.070152 .  
## best_dir_winyes                 2.028e-01  1.407e-01   1.441 0.150160    
## dvd_rel_year                   -1.084e-02  1.043e-02  -1.039 0.299084    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8385 on 550 degrees of freedom
## Multiple R-squared:  0.3888, Adjusted R-squared:  0.3644 
## F-statistic: 15.91 on 22 and 550 DF,  p-value: < 2.2e-16

As a result, we come up with a model imdb_rating ~ + imdb_num_votes + genre + thtr_rel_year + mpaa_rating + runtime + best_pic_nom + dvd_rel_month + best_dir_win + dvd_rel_year with adjusted R2 of 0.3643941 which is slightly higher than that of a full model (0.3605187). However, we can see that some of the variables have a p-value above the significance level of 5%. The place of a certain variable in the formula tells us about its impact on the adjusted R2 (the earlier in the formula, the greater the impact).

Let’s try a backward elimination technique and compare the resulting model with the model above. Backward elimination starts with the model that includes all potential predictor variables. Variables are eliminated one-at-a-time from the model until we cannot improve the adjusted R2. At each step, we eliminate the variable that leads to the largest improvement in adjusted R2.

## 
## Call:
## lm(formula = bwrd_bestformula, data = ff_imdb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8427 -0.4093  0.0707  0.5695  2.0833 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.851e+01  1.688e+01   2.873 0.004222 ** 
## genreAnimation                 -2.238e-01  3.497e-01  -0.640 0.522592    
## genreArt House & International  9.483e-01  2.786e-01   3.404 0.000713 ***
## genreComedy                    -2.578e-02  1.434e-01  -0.180 0.857393    
## genreDocumentary                1.020e+00  5.052e-01   2.019 0.043925 *  
## genreDrama                      6.829e-01  1.228e-01   5.558 4.25e-08 ***
## genreHorror                    -9.543e-02  2.167e-01  -0.440 0.659819    
## genreMusical & Performing Arts  1.193e+00  3.186e-01   3.745 0.000200 ***
## genreMystery & Suspense         4.269e-01  1.594e-01   2.678 0.007637 ** 
## genreOther                      4.683e-01  2.467e-01   1.898 0.058179 .  
## genreScience Fiction & Fantasy -9.631e-02  3.166e-01  -0.304 0.761131    
## runtime                         5.503e-03  2.472e-03   2.226 0.026401 *  
## mpaa_ratingNC-17               -1.778e-01  8.796e-01  -0.202 0.839905    
## mpaa_ratingPG                  -5.916e-01  2.581e-01  -2.293 0.022245 *  
## mpaa_ratingPG-13               -8.009e-01  2.683e-01  -2.985 0.002958 ** 
## mpaa_ratingR                   -5.131e-01  2.597e-01  -1.976 0.048673 *  
## mpaa_ratingUnrated              1.471e-01  3.526e-01   0.417 0.676637    
## thtr_rel_year                  -1.062e-02  4.774e-03  -2.225 0.026497 *  
## dvd_rel_year                   -1.084e-02  1.043e-02  -1.039 0.299084    
## dvd_rel_month                   1.902e-02  1.048e-02   1.814 0.070152 .  
## imdb_num_votes                  3.459e-06  3.537e-07   9.779  < 2e-16 ***
## best_pic_nomyes                 3.846e-01  1.992e-01   1.931 0.054054 .  
## best_dir_winyes                 2.028e-01  1.407e-01   1.441 0.150160    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8385 on 550 degrees of freedom
## Multiple R-squared:  0.3888, Adjusted R-squared:  0.3644 
## F-statistic: 15.91 on 22 and 550 DF,  p-value: < 2.2e-16

We ended up with the same model using the backward elimination technique. We should run a diagnostic for the following predictors included in a model:

genre, runtime, mpaa_rating, thtr_rel_year, dvd_rel_year, dvd_rel_month, imdb_num_votes, best_pic_nom, best_dir_win

Model diagnostics

To assess whether the multiple regression model is reliable, we need to check for:

  1. nearly normal residuals,
  2. constant variability of residuals,
  3. the residuals are independent, and
  4. each variable is linearly related to the outcome.

Let’s take a look at the Normal probability plot of residuals for our model.

We can see some fluctuations from a normal model. They are not extreme, though. There are no outliers that might be a cause of concern. In a normal probability plot for residuals, we tend to be most worried about residuals that appear to be outliers, since these indicate long tails in the distribution of residuals.

Absolute values of residuals against fitted values. The following plot is helpful to check the condition that the variance of the residuals is approximately constant.

We don’t see any obvious deviations from constant variance in this plot. As there are not many movies with IMDb ratings around 8, 9, or 10 in the data set, we don’t know if the variance for these values remains constant. However, there is no evidence that it doesn’t.

Residuals in order of their data collection. Such a plot is helpful in identifying any connection between cases that are close to one another. If it seems that consecutive observations tend to be close to each other, this indicates the independence assumption of the observations would fail. We know that the data set represents a random selection of movies, so we don’t expect any problems here.

As expected, here we see no structure that indicates a problem.

Last thing we need to check is Residuals against each predictor variable. The first row of the graphics below shows the residual plots:

Let’s look at the categorical variables first. For the genre variable we see a lot of deviations from constant variability between different genres here. The variability among the groups in mpaa_rating seems a little bit more constant. For example, Unrated movies (there are only 16 of them in a data frame) have significantly less variability than others and NC-17 has no variability at all (in fact, there’s only one movie of this category in a data set). Besides that, this variable looks ok. Both best_pic_nom, best_dir_win variables show the inconstant variability (the latter in a lesser extent, though). On the one hand, this can be explained by a significantly smaller number of movies or directors that have been nominated for Oscar. However, the reals reason is not clear from the data.

The numerical variables show more uniformity overall. With theatrical release year (thtr_rel_year) and DVD release year (dvd_rel_year) we don’t see a structure (distribution around 0 seems to be random). There might be some remaining structure in the DVD release month variable, though. We can see a little ‘wave’ going up and down here. There’s a clear ‘funneling’ in both runtime and imdb_num_votes variables. For instance, in the IMDb number of votes, prediction for the movies with lower number of votes has lower accuracy (larger difference between predicted and observed value) than for those with a large number of votes - this is not something unexpected as point estimate is getting closer to the actual population parameter as the sample size increases. It’s harder to explain why the accuracy of the model goes down for the shorter movies, though.

It’s worth mentioning that the following variables have the p-value above the significance level of 5%: best_dir_win, best_pic_nom, dvd_rel_year, and dvd_rel_month.

Interpretation of model coefficients

The table below shows the coefficients (or point estimates of the coefficients) for each predictor in our model.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.5063482 16.8831814 2.8730573 0.0042222
genreAnimation -0.2237513 0.3497397 -0.6397652 0.5225916
genreArt House & International 0.9482872 0.2785873 3.4039143 0.0007127
genreComedy -0.0257799 0.1433986 -0.1797776 0.8573934
genreDocumentary 1.0201916 0.5051834 2.0194481 0.0439252
genreDrama 0.6828532 0.1228487 5.5584900 0.0000000
genreHorror -0.0954325 0.2166944 -0.4404014 0.6598194
genreMusical & Performing Arts 1.1931570 0.3186205 3.7447588 0.0001996
genreMystery & Suspense 0.4268653 0.1594225 2.6775726 0.0076373
genreOther 0.4682855 0.2466858 1.8983072 0.0581788
genreScience Fiction & Fantasy -0.0963062 0.3166432 -0.3041474 0.7611306
runtime 0.0055027 0.0024717 2.2262654 0.0264007
mpaa_ratingNC-17 -0.1777673 0.8795546 -0.2021106 0.8399050
mpaa_ratingPG -0.5916361 0.2580573 -2.2926539 0.0222446
mpaa_ratingPG-13 -0.8008963 0.2682718 -2.9853911 0.0029582
mpaa_ratingR -0.5131497 0.2597135 -1.9758299 0.0486734
mpaa_ratingUnrated 0.1471493 0.3526396 0.4172796 0.6766367
thtr_rel_year -0.0106217 0.0047742 -2.2248402 0.0264969
dvd_rel_year -0.0108412 0.0104305 -1.0393807 0.2990844
dvd_rel_month 0.0190193 0.0104821 1.8144594 0.0701517
imdb_num_votes 0.0000035 0.0000004 9.7788640 0.0000000
best_pic_nomyes 0.3845828 0.1992098 1.9305415 0.0540535
best_dir_winyes 0.2027599 0.1407099 1.4409774 0.1501601

In this case, Intercept is a point where the regression line intercepts the y-axis represented by imdb_rating. The coefficient of 0.0055 for the runtime variable means that the average difference in IMDb rating for each additional minute of the runtime is +0.0055 when holding the other variables constant.

We can see that some of the variables have a larger impact on the IMDb rating. For example, if the MPAA rating of a movie is PG-13 (“Parental Guidance: some material may be inappropriate for children 13 and under”) reduces the IMDb rating of the movie by 0.8. A movie in the Musical & Performing Arts category increases the rating by 1.19. It’s the largest single factor.

The number of votes on IMDb (imdb_num_votes) seems to be having the least impact on the movie’s IMDb rating: each additional vote adds up only 0.0000035 to the rating. A movie needs a million votes to see a rating increased by 3.5. Considering that a movie in our data set gets on average only 62,861 votes the impact on the rating is not very large (0.22 on average).

As expected, both the DVD release year and month have just a tiny impact on IMDb rating (if any, considering their p-values above the significance level). The Best Picture nomination adds up 0.38 to the rating.


Part 5: Prediction

To check our model’s prediction accuracy we picked a movie that is not in the initial data set and that was released in 2016. This movie is “La La Land” (IMDb link: www.imdb.com/title/tt3783958/). As of December 17, 2019, it has an IMDb rating of 8.0 and 456,399 votes. The movie and its director, Damien Chazelle, were both nominated and won Oscars in 2017. The rest of the parameters has been combined in a data frame lalaland.

We use a predict function and use both our model and the lalaland data as an input:

##        fit      lwr      upr
## 1 7.345815 5.621457 9.070173

The model predicts an IMDb rating of 7.35 for this movie. The actual rating is 8.0. This means that the model underestimated real rating. The lwr and upr values show lower and upper bounds of a confidence interval of the prediction with a confidence level of 95%. We’re 95% confident that the actual IMDb rating of this movie is between 5.62 and 9.07. The interval is quite wide. This, in fact, agrees with the adjusted R2 value of 0.3644 meaning that only 36.44% of the variation in the rating is explained by the model.

We use a movie released in 2016 despite the fact that our model is based on data about movies released before 2016. This implies an extrapolation which makes our prediction even less reliable.


Part 6: Conclusion

In this research, we used factual data on 600+ movies and built a multiple regression model that predicts the IMDb rating. The Adjusted R2 was used as an estimate of explained variance. We tried two techniques of stepwise model selection (forward selection and backward elimination) which came up with the same model that explains approximately 36.44% of the variation in movie popularity measured by IMDb rating.

Answering the research question, we can predict the popularity of a movie without watching it but to a certain limited extent.

Fortunately for us, movie lovers, the popularity of a certain piece of art is based on something that is hard to measure and categorize. And this ‘something’ explains a good part of those remaining 63.56%.

H20 Machine Learning Tests

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         6 days 22 hours 
##     H2O cluster timezone:       America/Phoenix 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.28.0.1 
##     H2O cluster version age:    1 month and 8 days  
##     H2O cluster name:           H2O_started_from_R_vk_jet_kvi422 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.83 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.2 (2019-12-12)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## Warning in summary.H2OFrame(train): Approximated quantiles computed! If you are
## interested in exact quantiles, please pass the `exact_quantiles=TRUE` parameter.
##  imdb_rating     imdb_rating_rounded genre                   runtime      
##  Min.   :2.300   6:181               Drama             :235  Min.   : 68  
##  1st Qu.:5.871   7:145               Comedy            : 65  1st Qu.: 94  
##  Median :6.497   8: 61               Action & Adventure: 48  Median :105  
##  Mean   :6.394   5: 36               Mystery & Suspense: 39  Mean   :107  
##  3rd Qu.:7.099   4: 20               Horror            : 17  3rd Qu.:118  
##  Max.   :8.500   3:  3               Other             : 14  Max.   :194  
##  mpaa_rating  thtr_rel_year  thtr_rel_month   dvd_rel_year   dvd_rel_month   
##  R      :233  Min.   :1972   Min.   : 1.000   Min.   :1991   Min.   : 1.000  
##  PG-13  :105  1st Qu.:1991   1st Qu.: 4.000   1st Qu.:2001   1st Qu.: 3.000  
##  PG     : 84  Median :2000   Median : 6.000   Median :2003   Median : 6.000  
##  Unrated: 15  Mean   :1998   Mean   : 6.616   Mean   :2004   Mean   : 6.297  
##  G      : 11  3rd Qu.:2006   3rd Qu.:10.000   3rd Qu.:2007   3rd Qu.: 9.000  
##  NC-17  :  0  Max.   :2014   Max.   :12.000   Max.   :2015   Max.   :12.000  
##  imdb_num_votes   best_pic_nom best_actor_win best_actress_win best_dir_win
##  Min.   :   390   no :431      no :383        no :392          no :415     
##  1st Qu.:  5746   yes: 17      yes: 65        yes: 56          yes: 33     
##  Median : 18242                                                            
##  Mean   : 63912                                                            
##  3rd Qu.: 63989                                                            
##  Max.   :893008                                                            
##  top200_box
##  no :434   
##  yes: 14   
##            
##            
##            
## 
## [1] 448
## [1] 125
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |======================================================================| 100%
## Model Details:
## ==============
## 
## H2OMultinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1579320800507_105 
## Status of Neuron Layers: predicting imdb_rating_rounded, 8-class classification, multinomial distribution, CrossEntropy loss, 50 008 weights/biases, 596,1 KB, 4 480 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms momentum
## 1     1    40     Input  0.00 %       NA       NA        NA       NA       NA
## 2     2   200 Rectifier  0.00 % 0.000000 0.000000  0.203462 0.403377 0.000000
## 3     3   200 Rectifier  0.00 % 0.000000 0.000000  0.005006 0.002772 0.000000
## 4     4     8   Softmax      NA 0.000000 0.000000  0.134806 0.339789 0.000000
##   mean_weight weight_rms mean_bias bias_rms
## 1          NA         NA        NA       NA
## 2   -0.008759   0.094928  0.413450 0.036820
## 3   -0.005816   0.071364  0.981269 0.020033
## 4   -0.008355   0.392839 -0.012372 0.007795
## 
## 
## H2OMultinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("RTMP_sid_9d06_6")`
## MSE: (Extract with `h2o.mse`) 0.3082136
## RMSE: (Extract with `h2o.rmse`) 0.5551699
## Logloss: (Extract with `h2o.logloss`) 0.9386761
## Mean Per-Class Error: 0.4638141
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##        2 3 4  5   6   7  8 9  Error        Rate
## 2      1 0 0  0   1   0  0 0 0.5000 =     1 / 2
## 3      0 0 0  1   1   1  0 0 1.0000 =     3 / 3
## 4      0 0 0  8  11   1  0 0 1.0000 =   20 / 20
## 5      0 0 0 28   6   2  0 0 0.2222 =    8 / 36
## 6      0 0 0 15 139  19  8 0 0.2320 =  42 / 181
## 7      0 0 0  8  44  71 22 0 0.5103 =  74 / 145
## 8      0 0 0  2   7   6 46 0 0.2459 =   15 / 61
## 9      0 0 0  0   0   0  0 0     NA =     0 / 0
## Totals 1 0 0 62 209 100 76 0 0.3638 = 163 / 448
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-8 Hit Ratios: 
##   k hit_ratio
## 1 1  0.636161
## 2 2  0.848214
## 3 3  0.957589
## 4 4  0.984375
## 5 5  0.997768
## 6 6  1.000000
## 7 7  1.000000
## 8 8  1.000000
2 3 4 5 6 7 8 9 Error Rate
2 1 0 0 0 1 0 0 0 0.5000000 1 / 2
3 0 0 0 1 1 1 0 0 1.0000000 3 / 3
4 0 0 0 8 11 1 0 0 1.0000000 20 / 20
5 0 0 0 28 6 2 0 0 0.2222222 8 / 36
6 0 0 0 15 139 19 8 0 0.2320442 42 / 181
7 0 0 0 8 44 71 22 0 0.5103448 74 / 145
8 0 0 0 2 7 6 46 0 0.2459016 15 / 61
9 0 0 0 0 0 0 0 0 NaN 0 / 0
Totals 1 0 0 62 209 100 76 0 0.3638393 163 / 448
## H2OMultinomialMetrics: deeplearning
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.4921508
## RMSE: (Extract with `h2o.rmse`) 0.7015346
## Logloss: (Extract with `h2o.logloss`) 1.909072
## Mean Per-Class Error: 0.809222
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##        2 3 4  5  6  7  8 9  Error       Rate
## 2      0 0 0  0  1  0  0 0 1.0000 =    1 / 1
## 3      0 0 0  1  2  0  0 0 1.0000 =    3 / 3
## 4      0 0 0  2  5  0  0 0 1.0000 =    7 / 7
## 5      0 0 0  3  9  1  0 0 0.7692 =  10 / 13
## 6      0 0 0  9 19  8  2 0 0.5000 =  19 / 38
## 7      0 0 0  1 19 10 10 0 0.7500 =  30 / 40
## 8      0 0 0  1  4  5 12 0 0.4545 =  10 / 22
## 9      0 0 0  0  0  0  1 0 1.0000 =    1 / 1
## Totals 0 0 0 17 59 24 25 0 0.6480 = 81 / 125
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-8 Hit Ratios: 
##   k hit_ratio
## 1 1  0.352000
## 2 2  0.696000
## 3 3  0.848000
## 4 4  0.944000
## 5 5  0.968000
## 6 6  0.984000
## 7 7  0.992000
## 8 8  1.000000
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
##                                               model_id mean_per_class_error
## 1  DeepLearning_grid__1_AutoML_20200124_195055_model_1            0.6382081
## 2                         GBM_4_AutoML_20200124_195055            0.6435266
## 3  DeepLearning_grid__1_AutoML_20200124_195055_model_3            0.6462788
## 4          GBM_grid__1_AutoML_20200124_195055_model_23            0.6514436
## 5  DeepLearning_grid__2_AutoML_20200124_195055_model_1            0.6527988
## 6                         GBM_3_AutoML_20200124_195055            0.6565103
## 7          GBM_grid__1_AutoML_20200124_195055_model_17            0.6584141
## 8          GBM_grid__1_AutoML_20200124_195055_model_18            0.6594116
## 9                         GBM_2_AutoML_20200124_195055            0.6595320
## 10                        DRF_1_AutoML_20200124_195055            0.6620231
## 11         GBM_grid__1_AutoML_20200124_195055_model_16            0.6638960
## 12                        XRT_1_AutoML_20200124_195055            0.6656215
## 13         GBM_grid__1_AutoML_20200124_195055_model_26            0.6669583
## 14          GBM_grid__1_AutoML_20200124_195055_model_2            0.6675738
## 15      XGBoost_grid__1_AutoML_20200124_195055_model_5            0.6679752
## 16         GBM_grid__1_AutoML_20200124_195055_model_21            0.6684557
## 17      XGBoost_grid__1_AutoML_20200124_195055_model_4            0.6690985
## 18 DeepLearning_grid__1_AutoML_20200124_195055_model_2            0.6693460
## 19      XGBoost_grid__1_AutoML_20200124_195055_model_7            0.6704557
## 20 DeepLearning_grid__1_AutoML_20200124_195055_model_4            0.6714398
## 21         GBM_grid__1_AutoML_20200124_195055_model_10            0.6734082
## 22          GBM_grid__1_AutoML_20200124_195055_model_4            0.6739245
## 23                    XGBoost_1_AutoML_20200124_195055            0.6746378
## 24                        GBM_1_AutoML_20200124_195055            0.6758915
## 25 DeepLearning_grid__3_AutoML_20200124_195055_model_1            0.6797380
## 26         GBM_grid__1_AutoML_20200124_195055_model_28            0.6799795
## 27          GBM_grid__1_AutoML_20200124_195055_model_5            0.6811640
## 28                    XGBoost_3_AutoML_20200124_195055            0.6823559
## 29         GBM_grid__1_AutoML_20200124_195055_model_15            0.6825816
## 30 DeepLearning_grid__3_AutoML_20200124_195055_model_2            0.6841817
## 31         GBM_grid__1_AutoML_20200124_195055_model_19            0.6848488
## 32          GBM_grid__1_AutoML_20200124_195055_model_8            0.6887564
## 33      XGBoost_grid__1_AutoML_20200124_195055_model_2            0.6915332
## 34         GBM_grid__1_AutoML_20200124_195055_model_11            0.6922114
## 35         GBM_grid__1_AutoML_20200124_195055_model_29            0.6923116
## 36 DeepLearning_grid__2_AutoML_20200124_195055_model_2            0.6924489
## 37                    XGBoost_2_AutoML_20200124_195055            0.6963317
## 38         GBM_grid__1_AutoML_20200124_195055_model_24            0.7002770
## 39         GBM_grid__1_AutoML_20200124_195055_model_13            0.7005028
## 40 DeepLearning_grid__2_AutoML_20200124_195055_model_3            0.7026830
## 41         GBM_grid__1_AutoML_20200124_195055_model_20            0.7032008
## 42      XGBoost_grid__1_AutoML_20200124_195055_model_3            0.7065955
## 43               DeepLearning_1_AutoML_20200124_195055            0.7095069
## 44    StackedEnsemble_AllModels_AutoML_20200124_195055            0.7112976
## 45         GBM_grid__1_AutoML_20200124_195055_model_22            0.7138926
## 46      XGBoost_grid__1_AutoML_20200124_195055_model_6            0.7192156
## 47          GBM_grid__1_AutoML_20200124_195055_model_6            0.7195311
## 48         GBM_grid__1_AutoML_20200124_195055_model_12            0.7215755
## 49                        GBM_5_AutoML_20200124_195055            0.7216125
## 50         GBM_grid__1_AutoML_20200124_195055_model_27            0.7226234
## 51         GBM_grid__1_AutoML_20200124_195055_model_14            0.7235844
## 52          GBM_grid__1_AutoML_20200124_195055_model_9            0.7243380
## 53          GBM_grid__1_AutoML_20200124_195055_model_1            0.7245368
## 54 DeepLearning_grid__1_AutoML_20200124_195055_model_5            0.7247893
## 55          GBM_grid__1_AutoML_20200124_195055_model_3            0.7259896
## 56          GBM_grid__1_AutoML_20200124_195055_model_7            0.7300343
## 57      XGBoost_grid__1_AutoML_20200124_195055_model_1            0.7327777
## 58 DeepLearning_grid__3_AutoML_20200124_195055_model_3            0.7358188
## 59 StackedEnsemble_BestOfFamily_AutoML_20200124_195055            0.7500000
## 60                        GLM_1_AutoML_20200124_195055            0.7500000
## 61         GBM_grid__1_AutoML_20200124_195055_model_25            0.7520922
##      logloss      rmse       mse
## 1   2.874357 0.6821987 0.4653951
## 2   1.469201 0.6700009 0.4489012
## 3   1.565344 0.6396374 0.4091360
## 4   1.437569 0.6806231 0.4632477
## 5   3.359979 0.6843742 0.4683681
## 6   1.471811 0.6751921 0.4558844
## 7  22.859580 0.8145799 0.6635405
## 8   2.029109 0.8684317 0.7541736
## 9   1.509593 0.6783006 0.4600917
## 10  2.306275 0.6653441 0.4426828
## 11  1.485650 0.7155002 0.5119405
## 12  1.648765 0.6677945 0.4459494
## 13  1.771981 0.8248887 0.6804414
## 14  1.864230 0.6709628 0.4501910
## 15  1.388091 0.7085196 0.5020000
## 16  1.513849 0.6852810 0.4696101
## 17  1.372579 0.7008606 0.4912056
## 18  1.820122 0.6687782 0.4472642
## 19  1.371367 0.7003057 0.4904280
## 20  1.419923 0.6516111 0.4245970
## 21  1.794830 0.8300555 0.6889921
## 22  1.785640 0.8273814 0.6845600
## 23  1.329911 0.6860047 0.4706024
## 24  1.588559 0.6871091 0.4721189
## 25  1.431426 0.6407995 0.4106240
## 26  1.793407 0.8295642 0.6881767
## 27  1.831146 0.8364072 0.6995771
## 28  1.361331 0.7055882 0.4978547
## 29  1.879541 0.8456149 0.7150646
## 30  1.396300 0.6541300 0.4278861
## 31  1.790738 0.8290682 0.6873542
## 32  1.766026 0.8248022 0.6802987
## 33  1.434423 0.7365913 0.5425667
## 34  7.673091 0.7475792 0.5588746
## 35  1.438542 0.7306918 0.5339105
## 36  1.291925 0.6589075 0.4341590
## 37  1.387477 0.7210912 0.5199725
## 38  3.187138 0.7131369 0.5085642
## 39  1.472999 0.6875668 0.4727481
## 40  1.712796 0.6743905 0.4548026
## 41  1.813074 0.8339442 0.6954629
## 42  1.398919 0.7258960 0.5269250
## 43  1.343313 0.6800739 0.4625005
## 44  1.340312 0.6886199 0.4741974
## 45  1.807818 0.8312863 0.6910369
## 46  1.398088 0.7227731 0.5224009
## 47  1.902870 0.8496652 0.7219309
## 48  2.040000 0.8699202 0.7567612
## 49  1.403047 0.7152745 0.5116176
## 50  1.857853 0.8421254 0.7091753
## 51  1.777875 0.8273437 0.6844977
## 52  1.837808 0.8384675 0.7030278
## 53  1.414065 0.7220073 0.5212945
## 54  1.497135 0.6713075 0.4506537
## 55  1.905398 0.8499961 0.7224934
## 56  1.541743 0.7683763 0.5904021
## 57  1.632353 0.7944841 0.6312049
## 58  1.399945 0.7021186 0.4929705
## 59  1.411271 0.7174727 0.5147671
## 60  1.459866 0.7170132 0.5141079
## 61 20.216931 0.8216195 0.6750585
## 
## [61 rows x 5 columns]
## H2OMultinomialMetrics: deeplearning
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.5255961
## RMSE: (Extract with `h2o.rmse`) 0.7249801
## Logloss: (Extract with `h2o.logloss`) 4.462709
## Mean Per-Class Error: 0.7976333
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##        2 3 4 5  6  7  8 9  Error       Rate
## 2      0 0 0 0  1  0  0 0 1.0000 =    1 / 1
## 3      0 0 0 0  3  0  0 0 1.0000 =    3 / 3
## 4      0 0 1 1  5  0  0 0 0.8571 =    6 / 7
## 5      0 0 0 0 10  3  0 0 1.0000 =  13 / 13
## 6      0 1 0 1 25  8  3 0 0.3421 =  13 / 38
## 7      0 0 0 0 18 20  2 0 0.5000 =  20 / 40
## 8      0 0 0 0  4 11  7 0 0.6818 =  15 / 22
## 9      0 0 0 0  0  0  1 0 1.0000 =    1 / 1
## Totals 0 1 1 2 66 42 13 0 0.5760 = 72 / 125
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-8 Hit Ratios: 
##   k hit_ratio
## 1 1  0.424000
## 2 2  0.688000
## 3 3  0.880000
## 4 4  0.936000
## 5 5  0.976000
## 6 6  0.992000
## 7 7  1.000000
## 8 8  1.000000

Here, we try to get a better model by training on all variables (excluding those related to Rotten Tomatoes) without manually excluding some of them as we did for linear regression tests.

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%
##                                               model_id mean_per_class_error
## 1                     XGBoost_3_AutoML_20200124_195958            0.7886034
## 2       XGBoost_grid__1_AutoML_20200124_195958_model_2            0.7917549
## 3                     XGBoost_1_AutoML_20200124_195958            0.7924184
## 4                DeepLearning_1_AutoML_20200124_195958            0.8036197
## 5       XGBoost_grid__1_AutoML_20200124_195958_model_3            0.8065900
## 6       XGBoost_grid__1_AutoML_20200124_195958_model_1            0.8096159
## 7           GBM_grid__1_AutoML_20200124_195958_model_5            0.8099997
## 8  DeepLearning_grid__1_AutoML_20200124_195958_model_1            0.8107425
## 9       XGBoost_grid__1_AutoML_20200124_195958_model_5            0.8138672
## 10                        XRT_1_AutoML_20200124_195958            0.8139340
## 11                        GBM_1_AutoML_20200124_195958            0.8159317
## 12                        GBM_2_AutoML_20200124_195958            0.8258173
## 13                        GBM_3_AutoML_20200124_195958            0.8269731
## 14                        GBM_4_AutoML_20200124_195958            0.8295011
## 15                        DRF_1_AutoML_20200124_195958            0.8340427
## 16                    XGBoost_2_AutoML_20200124_195958            0.8343168
## 17          GBM_grid__1_AutoML_20200124_195958_model_4            0.8387184
## 18          GBM_grid__1_AutoML_20200124_195958_model_1            0.8433453
## 19          GBM_grid__1_AutoML_20200124_195958_model_3            0.8434071
## 20          GBM_grid__1_AutoML_20200124_195958_model_2            0.8466630
## 21      XGBoost_grid__1_AutoML_20200124_195958_model_4            0.8484848
## 22                        GBM_5_AutoML_20200124_195958            0.8592283
## 23                        GLM_1_AutoML_20200124_195958            0.8750000
## 24    StackedEnsemble_AllModels_AutoML_20200124_195958            0.8750000
## 25 StackedEnsemble_BestOfFamily_AutoML_20200124_195958            0.8750000
##      logloss      rmse       mse
## 1   1.386488 0.7127303 0.5079845
## 2   1.372170 0.7064156 0.4990229
## 3   1.344066 0.6918348 0.4786353
## 4   1.389324 0.6802779 0.4627780
## 5   1.414370 0.7251060 0.5257788
## 6   1.418984 0.7133923 0.5089285
## 7   2.107188 0.7266780 0.5280609
## 8   1.751122 0.7025074 0.4935166
## 9   1.466230 0.7377222 0.5442340
## 10  1.864578 0.6796911 0.4619799
## 11  1.808830 0.7335619 0.5381130
## 12  1.836281 0.7368593 0.5429616
## 13  1.840641 0.7429918 0.5520368
## 14  1.852366 0.7343269 0.5392360
## 15  3.498861 0.6991648 0.4888314
## 16  1.469833 0.7407563 0.5487199
## 17  1.937663 0.8500078 0.7225132
## 18 13.360339 0.7987343 0.6379766
## 19  1.869626 0.8401822 0.7059061
## 20  1.833578 0.7584323 0.5752195
## 21  1.709908 0.8122720 0.6597858
## 22  1.750916 0.7596533 0.5770732
## 23  1.500676 0.7341346 0.5389537
## 24  1.467663 0.7344108 0.5393593
## 25  1.464898 0.7341335 0.5389519
## 
## [25 rows x 5 columns]
## H2OMultinomialMetrics: xgboost
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.4869021
## RMSE: (Extract with `h2o.rmse`) 0.6977837
## Logloss: (Extract with `h2o.logloss`) 1.313213
## Mean Per-Class Error: 0.6422532
## R^2: (Extract with `h2o.r2`) 0.6475947
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##        2 3 4 5   6  7  8 9  Error        Rate
## 2      0 0 0 0   1  0  0 0 1.0000 =     1 / 1
## 3      0 0 0 0   2  0  0 0 1.0000 =     2 / 2
## 4      0 0 0 0   9  1  0 0 1.0000 =   10 / 10
## 5      0 0 0 2  17  1  0 0 0.9000 =   18 / 20
## 6      0 0 0 0  51 11  0 0 0.1774 =   11 / 62
## 7      0 0 0 0  35 26  5 0 0.6061 =   40 / 66
## 8      0 0 0 0   8 12 24 0 0.4545 =   20 / 44
## 9      0 0 0 0   0  0  0 0     NA =     0 / 0
## Totals 0 0 0 2 123 51 29 0 0.4976 = 102 / 205
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-8 Hit Ratios: 
##   k hit_ratio
## 1 1  0.502439
## 2 2  0.824390
## 3 3  0.912195
## 4 4  0.946341
## 5 5  0.985366
## 6 6  1.000000
## 7 7  1.000000
## 8 8  1.000000

We ran several ML tests and the most interesting result so far is that studio is the second most important variable explaining movie popularity on IMDb. In our linear regression analysis we excluded this variable because it has over 100 different values so the regression process would take too long. With machine learning we don’t have this issue.

Gradient Boosting Machine (or GBM) model returns the most impressive results on training dataset with 0 errors. On the test data it has an error in around 50% of cases. It seems like we deal with overfitting problem here (or our interpretation of output is incorrect) . Further tests needed.

Let’s build a standalone GBM model

## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [actor4, actor3, actor2, actor1, director, rt_url, imdb_url, title, actor5].
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |======================================================================| 100%
## Model Details:
## ==============
## 
## H2OMultinomialModel: gbm
## Model ID:  GBM_model_R_1579320800507_106 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             300                     2400              219243         1
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         1    1.00000          2          2     2.00000
## 
## 
## H2OMultinomialMetrics: gbm
## ** Reported on training data. **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("RTMP_sid_9d06_351")`
## MSE: (Extract with `h2o.mse`) 0.1201133
## RMSE: (Extract with `h2o.rmse`) 0.3465736
## Logloss: (Extract with `h2o.logloss`) 0.3736369
## Mean Per-Class Error: 0.07918575
## R^2: (Extract with `h2o.r2`) 0.9055787
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##        2 3  4  5   6   7  8 9  Error       Rate
## 2      2 0  0  0   0   0  0 0 0.0000 =    0 / 2
## 3      0 5  0  0   0   0  0 0 0.0000 =    0 / 5
## 4      0 0 15  1   4   0  0 0 0.2500 =   5 / 20
## 5      0 0  0 28   3   0  0 0 0.0968 =   3 / 31
## 6      0 0  1  0 151  13  0 0 0.0848 = 14 / 165
## 7      0 0  0  1  14 120  3 0 0.1304 = 18 / 138
## 8      0 0  0  0   4   2 78 0 0.0714 =   6 / 84
## 9      0 0  0  0   0   0  0 1 0.0000 =    0 / 1
## Totals 2 5 16 30 176 135 81 1 0.1031 = 46 / 446
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-8 Hit Ratios: 
##   k hit_ratio
## 1 1  0.896861
## 2 2  0.995516
## 3 3  1.000000
## 4 4  1.000000
## 5 5  1.000000
## 6 6  1.000000
## 7 7  1.000000
## 8 8  1.000000
## H2OMultinomialMetrics: gbm
## 
## Test Set Metrics: 
## =====================
## 
## MSE: (Extract with `h2o.mse`) 0.4604392
## RMSE: (Extract with `h2o.rmse`) 0.6785567
## Logloss: (Extract with `h2o.logloss`) 1.762086
## Mean Per-Class Error: 0.6648888
## R^2: (Extract with `h2o.r2`) 0.6667478
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##        2 3 4 5   6  7  8 9  Error        Rate
## 2      0 0 0 0   1  0  0 0 1.0000 =     1 / 1
## 3      0 0 1 1   0  0  0 0 1.0000 =     2 / 2
## 4      0 0 0 0  10  0  0 0 1.0000 =   10 / 10
## 5      0 0 2 2  13  2  1 0 0.9000 =   18 / 20
## 6      0 0 2 3  44 12  1 0 0.2903 =   18 / 62
## 7      0 0 0 1  31 29  5 0 0.5606 =   37 / 66
## 8      0 0 1 0   6 18 19 0 0.5682 =   25 / 44
## 9      0 0 0 0   0  0  0 0     NA =     0 / 0
## Totals 0 0 6 7 105 61 26 0 0.5415 = 111 / 205
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
## =======================================================================
## Top-8 Hit Ratios: 
##   k hit_ratio
## 1 1  0.458537
## 2 2  0.721951
## 3 3  0.848781
## 4 4  0.931707
## 5 5  0.980488
## 6 6  0.995122
## 7 7  1.000000
## 8 8  1.000000

We did numerous experiments with GBM and the best result we came up with is the model that offers approximately 43.75% errors on test dataset. The model used the maximum number of trees of 58 and the tree depth of 1 (stumps). However, re-splitting the data made this model less more effective. It had around 66% of errors on a test data. So we needed to readjust the model using different number of trees to reduce the error rate.

One thing worth mentioning, majority of models we tested did pretty good job in a sense of deviation around true value. In most cases, predicted values are off just by one point (module of residual is equal to 1). For example, the true value is 7, the model predicts either 6 or 8 (not 3 or 9, for example). So it might be a better idea of creating a confidence interval for the predictions instead of reporting errors each time predictor is off by one point. It could be something like predictor ±1. Further research is needed.